18 research outputs found

    RowCore: A Processing-Near-Memory Architecture for Big Data Machine Learning

    Get PDF
    The technology-push of die stacking and application-pull of Big Data machine learning (BDML) have created a unique opportunity for processing-near-memory (PNM). This paper makes four contributions: (1) While previous PNM work explores general MapReduce workloads, we identify three workload characteristics: (a) irregular-and-compute-light (i.e., perform only a few operations per input word which include data-dependent branches and indirect memory accesses); (b) compact (i.e., the computation has a small intermediate live data and uses only a small amount of contiguous input data); and (c) memory-row-dense (i.e., process the input data without skipping over many bytes). We show that BDMLs have or can be transformed to have these characteristics which, except for irregularity, are necessary for bandwidth- and energyefficient PNM, irrespective of the architecture. (2) Based on these characteristics, we propose RowCore, a row-oriented PNM architecture, which (pre)fetches and operates on entire memory rows to exploit BDMLs’ row-density. Instead of this row-centric access and compute-schedule, traditional architectures opportunistically improve row locality while fetching and operating on cache blocks. (3) RowCore employs well-known MIMD execution to handle BDMLs’ irregularity, and sequential prefetch of input data to hide memory latency. In RowCore, however, one corelet prefetches a row for all the corelets which may stray far from each other due to their MIMD execution. Consequently, a leading corelet may prematurely evict the prefetched data before a lagging corelet has consumed the data. RowCore employs novel cross-corelet flow-control to prevent such eviction. (4) RowCore further exploits its flow-controlled prefetch for frequency scaling based on novel coarse-grain compute-memory rate-matching which decreases (increases) the processor clock speed when the prefetch buffers are empty (full). Using simulations, we show that RowCore improves performance and energy, by 135% and 20% over a GPGPU with prefetch, and by 35% and 34% over a multicore with prefetch, when all three architectures use the same resources (i.e., number of cores, and on-processor-die memory) and identical diestacking (i.e., GPGPUs/multicores/RowCore and DRAM)

    Dark Silicon is Sub-Optimal and Avoidable

    Get PDF
    Several recent papers argue that due to the slowing down of Dennard’s scaling of the supply voltage future multicore performance will be limited by dark silicon where an increasing number of cores are kept powered down due to lack of power. Customizing the cores to improve power efficiency may incur increased effort for hardware design, verification and test, and degraded programmability. In this paper, we show that dark silicon is sub-optimal in performance and avoidable, and that a gentler, evolutionary path for multicores exists. We make the key observations that (1) previous papers examine voltage-frequency-scaled designs on the power-performance Pareto frontier whereas the frontier extends to a new region derived by frequency scaling alone where voltage-scaled designs are infeasible, and (2) because memory latency improves only slowly over generations, performance of future multicores’ workloads will be dominated by memory latency. Guided by these observations and a simple analytical model, we exploit (1) the sub-linear impact of clock speed on performance in the presence of memory latency, and (2) the super-linear impact of throughput on queuing delays. Accordingly, we propose an evolutionary path for multicores, called successive frequency unscaling (SFU). Compared to dark silicon. SFU keeps powered significantly more cores running at clock frequencies on the extended Pareto frontier that are succesively lowered every generation to stay within the power budget. The higher active core count enables more memory-level parallelism, non-linearly offsetting the slower clock and resulting in more performance than that of dark silicon. For memory-intensive workloads, full SFU, where all the cores are powered up, performs 81% better than dark silicon at the 11 nm technology node. For enterprise workloads where both throughput and response times are important, we employ controlled SFU (C-SFU) which moderately slows down the clock and powers many, but not all, cores to achieve 29% better throughput than dark silicon at the 11 nm technology node. The higher throughput non-linearly reduces queuing delays and thereby compensates for the slower clock, resulting in C-SFU’s total response latency to be within +/- 10% of that of dark silicon

    apSLIP: A High-performance Adaptive-Effort Pipelined Switch Allocator

    Get PDF
    Switch allocation and queuing discipline has a first-order impact on network performance and hence overall system performance. Unfortunately, there is a fundamental tension between quality of switch allocation and clock-speed. On one hand, sophisticated switch allocators such as iSLIP include dependencies that make pipelining hard. On the other hand, simpler allocators which are pipelineable (and hence amenable to fast clocks) degrade throughput. This paper proposes apSLIP which uses three novel ideas to adaptively pipeline iSLIP at fast clocks. To address the dependence between the grant and request stages in iSLIP, we allow superfluous requests to occur and leverage the VOQ architecture which naturally enables easy availing of the corresponding grants. To address the dependence between the reading and updating of priority counters in iSLIP, we use stale priority values and solve the resulting double booking by privatizing the priority counters and separating the arbitration into odd and even stream. Further, we observe that while iSLIP can exploit multiple iterations to improve its matching strength, such additional iterations deepen the pipeline and add to the network latency. The improved matching strength helps high-load scenarios whereas the increased latency hurts low-load cases. Therefore, we propose an adaptive-effort pipelined iSLIP – apSLIP – which adapts between one iteration (shallow-pipeline) at low loads and two iterations (deep pipeline) at high loads. Simulations reveal that compared to an aggressive 2-cycle router apSLIP improves, on average, end-to-end packet latency in an 8x8 network by 43% and high-load application performance in a 3x3 network by 19% without affecting the low-load benchmarks

    Deadline-aware datacenter tcp (D2TCP)

    Get PDF
    This Note analyzes the scope of appellate review that should be accorded to a trial judge\u27s determination of nonobviousness. Part I details the condition of nonobviousness and how it has evolved into the principal obstacle to patentability. Part II analyzes the Supreme Court and appellate precedents on the scope of review on this issue. Part III evaluates the policy underpinnings of Rule 52(a) and applies a two-pronged analysis to the nonobviousness requirement to determine whether the clearly erroneous standard of review is appropriate. This Note concludes that the treatment of the nonobviousness determination as a question of law cannot be justified on either analytical or policy grounds, and should be treated as a question of fact subject to the clearly erroneous standard

    Transient-fault recovery for chip multiprocessors

    No full text

    Speculative Versioning Cache

    No full text

    A Dynamic Approach to Improve the Accuracy of Data Speculation

    No full text
    corecore